An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos
نویسندگان
چکیده
منابع مشابه
TST/BTD: An End-to-End Visual Recognition System
We describe a visual recognition system operating on a hand-held device. Feature selection and tracking are performed in real-time, and used to train a template-based classifier during a capture phase prompted by the user. During normal operation, the system scores objects in the field of view based on their ranking. Severe resource constraints have prompted a re-evaluation of existing algorith...
متن کاملJoint CTC/attention decoding for end-to-end speech recognition
End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and contextdependency trees, leading to a greatly simplified model-building process. There are two major types of end-to-end architectures for ASR: attention-based methods use an attenti...
متن کاملLocal Monotonic Attention Mechanism for End-to-End Speech Recognition
Recently, encoder-decoder neural networks have shown impressive performance on many sequence-related tasks. The architecture commonly uses an attentional mechanism which allows the model to learn alignments between the source and the target sequence. Most attentional mechanisms used today is based on a global attention property which requires a computation of a weighted summarization of the who...
متن کاملAn End-to-end 3D Convolutional Neural Network for Action Detection and Segmentation in Videos
Deep learning has been demonstrated to achieve excellent results for image classification and object detection. However, the impact of deep learning on video analysis (e.g. action detection and recognition) has not been that significant due to complexity of video data and lack of annotations. In addition, training deep neural networks on large scale video datasets is extremely computationally e...
متن کاملAudio Visual Emotion Recognition with Temporal Alignment and Perception Attention
This paper focuses on two key problems for audiovisual emotion recognition in the video. One is the audio and visual streams temporal alignment for feature level fusion. The other one is locating and re-weighting the perception attentions in the whole audiovisual stream for better recognition. The Long Short Term Memory Recurrent Neural Network (LSTM-RNN) is employed as the main classification ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the AAAI Conference on Artificial Intelligence
سال: 2020
ISSN: 2374-3468,2159-5399
DOI: 10.1609/aaai.v34i01.5364